How to analyse scover outputΒΆ

how_to_analyse_scover_output.utf8

Introduction

This vignette focusses on downstream analysis of the output from scover. After running scover, it produces several outputs. The most important are:

  1. The alignments of the convolutional kernels to a database of motifs, with significance (q-)scores for each alignment
  2. The influence scores of the convolutional kernels in the neural network, representing the change in prediction when a particular motif is left out

The central part of scover trains k (default: 10) different neural network models, each using a different 90% of the dataset for training and the remaining 10% for testing. Each model contains d (default: 300) convolutional filters, so there are a total of kd (default: 3,000) motifs. To ensure that the results are robust, scover groups similar motif and then identifies motif clusters that show up at least k/2 models.

The motifs are named as MODELNUMBER_MOTIFNUMBER, e.g.Β motif 2_3 is motif 3 from model 2. The best performing out of 10 models is named β€œbest” rather than its model number, e.g.Β motif best_129.

All of the motifs are aligned with Tomtom, the motif comparison tool from the MEME Suite, against a database of known motifs. This generates a tab-separated file that shows how the detected motifs aligned to the reference motifs and the q-values for the alignments. This result can be represented as a bipartite graph where the model motifs and the database motifs represent the two categories of nodes. Using this graph we can select reproducible motifs. For a visual overview, see the figure below.

scover_output_workflow

One of the outputs from the model is a HDF5 file containing motif influence scores for each model. The scores represent the mean change in prediction for a particular pool when a convolutional filter is left out and they are calculated for each convolutional filter in each pool.

Below is a schematic overview of scover’s workflow for motif analysis. The downstream analysis described in this section starts at the third step, where we have obtained influence scores from the models and the motif alignments from Tomtom.

scover_output_workflow

Load packages

I use the following R packages during this guide (see the sessionInfo() at the end of this guide for an overview of all the loaded packages):

suppressMessages(library(rhdf5))
suppressMessages(library(mixtools))
suppressMessages(library(SingleCellExperiment))
suppressMessages(library(dplyr))
suppressMessages(library(magrittr))
suppressMessages(library(reshape2))
suppressMessages(library(igraph))
suppressMessages(library(stringr))
suppressMessages(library(ggplot2))
suppressMessages(library(ggthemes))
suppressMessages(library(gghalves))
suppressMessages(library(ggrepel))
suppressMessages(library(pheatmap))
suppressMessages(library(plotly))
suppressMessages(library(tidytext))

options(stringsAsFactors = F)

I’ve added some custom functions to the following file (find it in the resources folder of this repository):

source("scover_helper_functions.R")

Set paths

For the rest of the analysis, I am using paths to the network output (exp_folder), the input dataset (data_path), and the cell type annotation (colData_path). I also need to input the motif database that was used for motif alignment (motif_db_path). I will also create an output directory for the analysis. In this vignette we will consider the human fetal and adult kidney dataset from Stewart et al that was analyzed in our manuscript.

# Network output dir
exp_folder <- "scover_output/"

# Network input datasets
data_path <- "data/20200825_combined_mature_fetal_kidney_pool120_5u5d.tsv.gz"
colData_path <- "data/20200825_combined_mature_fetal_kidney_pool120_5u5d_colData.tsv"

# Motif alignment db
motif_db_path <- "data/Homo_sapiens.meme"

# Output directory
outdir <- "analysis"
dir.create(outdir)
## Warning in dir.create(outdir): 'analysis' already exists

Read data

First, I will use the custom functions to read the input data, the Tomtom output, and the network influence scores. Then I will filter the Tomtom output to only include TFs that were expressed.

sce <- scover_dataset_to_sce(data_path = data_path, 
                      colData_path = colData_path)

tomtom_table <- read_tomtom_output(motif_db_path = motif_db_path,
                                    exp_folder = exp_folder)

influence_scores <- read_network_hdf5(exp_folder, sce)

# Filter the tomtom table to only include expressed TFs
tomtom_table <- filter_tomtom_table(tomtom_table = tomtom_table, sce = sce)

Graph analysis

Now I will generate a motif alignment graph and plot the graph:

alignment_graph <- create_alignment_graph(tomtom_table)

# Cluster alignment graph
alignment_graph_clusters <- igraph::cluster_walktrap(alignment_graph)
alignment_graph_cluster_groups <- igraph::groups(alignment_graph_clusters)

# Plot two alignment graphs
plot_alignment_graph(outdir, alignment_graph, alignment_graph_clusters)

This is an example output from the last function:

scover_alignment_plot

To get the motif clusters into a dataframe, I use a custom function (from scover_helper_functions.R):

cluster_df <- construct_cluster_df(exp_folder = exp_folder, 
                                   alignment_graph_cluster_groups = 
                                     alignment_graph_cluster_groups)

I also want to construct a matrix where the motif influence scores are averaged per cell type:

average_influence_matrix <- construct_average_influence_matrix(influence_scores)

I use this function to get information about the motifs in this matrix. This will also plot the motif annotation as β€œgood” or β€œbad”:

(If there are two groups of motifs, one with very very low impact scores and one with higher impact scores, then the group with very very low impact scores will be marked as β€œbad”)

motif_stats <- get_motif_stats(cluster_df = cluster_df,
                               average_influence_matrix = average_influence_matrix, 
                               pseudocount = 1e-11)
## number of iterations= 191 
## No clear bimodal distribution: likely not a lot of 'dead motifs'
## 
## Attaching package: 'grid'
## The following object is masked from 'package:mixtools':
## 
##     depth
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The next step will select β€˜good motifs’ based on impact, motif cluster reproducibility, and motif cluster annotation (only the clusters that aligned to known motifs will be included in this analysis):

# Get names of good motifs: the ones that have high impact, are from 
# reproducible motif clusters, and are aligned to known motifs
motif_stats <- motif_stats[motif_stats$cluster_annot != "NA",]

good_motifs <- rownames(motif_stats)[motif_stats$is_good_motif=="Yes" &
                                       as.numeric(motif_stats$cluster_reprod) >= 0.5]

influence_scores <- influence_scores[good_motifs,]
motif_stats <- motif_stats[good_motifs,]
average_influence_matrix <- average_influence_matrix[good_motifs,]

Plotting the motif influence scores across cell types

If we were to plot the average influence scores of the motifs across the cell types, it would look something like this:

# Only use this function if you have 10 or fewer motif clusters at this point.
annotation_colors <- get_annotation_colors(motif_stats)

# Set colors for heatmaps here
heatmap_colors <- colorRampPalette(c("magenta", "black", "yellow"))(100)

# The "drop = FALSE" will make sure the annotation stays a dataframe
pheatmap(average_influence_matrix, 
         annotation_row = motif_stats[,c("cluster_annot"), drop = FALSE], 
         show_rownames = FALSE, angle_col = 45, 
         annotation_colors = annotation_colors, 
         cellwidth = 8, 
         cellheight = .4,
         color = heatmap_colors)

Here, each row corresponds to a convolutional filter and each column corresponds to a cell type.

As you can see, a small subset of the motifs (near the bottom) has the largest impact. If we instead plot Z-transformed motif influence scores, it looks like this:

# Use custom function to get 
average_influence_matrix_z_scores <- to_z(average_influence_matrix)

pheatmap(average_influence_matrix_z_scores, 
         annotation_row=motif_stats[,c("cluster_annot"), drop=FALSE], 
         show_rownames = FALSE, angle_col = 45, 
         annotation_colors = annotation_colors, 
         cellwidth = 8, 
         cellheight = .4, 
         color = heatmap_colors)

Already you can see that the motif clusters form more coherent groups. It is important to keep in mind that the relative impacts of motifs in a single cell type cannot be deduced from this heatmap, as the values have been transformed per row. What this does show is the general β€˜trend’ of the influence scores of a motif across the cell types.

Z-scores can, however, inflate the differences between cell types. This is why it is beneficial to have another score that tells you something about how β€œcell-type-specific” a motif influence score really is. Towards this end, the β€œmotif cell type entropy” score has been calculated for each motif. It is defined as the normalised Shannon entropy of a motif influence score across cell types. A low motif cell entropy score suggests that the influence score distribution is non-uniform.

Let’s plot the motifs along with the entropy scores. I will also order the motifs for their motif cluster annotation. This will give us a better overview of how motif clusters vary across cell types.

pheatmap(average_influence_matrix_z_scores[order(motif_stats$cluster_annot),], 
         annotation_row=motif_stats[,c("cluster_annot", 
                                       "motif_celltype_entropy")],
         show_rownames = FALSE, angle_col = 45, 
         annotation_colors = annotation_colors, 
         cluster_rows = FALSE, 
         cellwidth = 8, 
         cellheight = .4, 
         color = heatmap_colors)

Better cell type annotation

Many datasets will have a large number of cell types, and it is often beneficial to group the cell types into a smaller number of β€œcell categories”. For this, I have created (for this specific example) a named vector with the different cell types as names that contains the corresponding cell type category:

Category_table <- readRDS("data/Category_table.RDS")
print(head(Category_table))
##               Fetal B cell       Fetal Cap mesenchyme 
##                   "Immune"       "Nephron progenitor" 
##           Fetal CD4 T cell                 Fetal cDC2 
##                   "Immune"                   "Immune" 
## Fetal CNT/PC - proximal UB Fetal Distal renal vesicle 
##       "Nephron epithelium"       "Nephron progenitor"

To add this information to the sce object, you can use this:

sce$Category <- Category_table[sce$cell_type1]

Refining motif cluster annotations

The current motif cluster annotation is not very specific, but this is a consequence of the fact that most motifs can be bound by multiple TFs making it impossible to tell which one is binding. Nevertheless, if expression information is available we can take advantage of it to identify the most likely candidate. The underlying assumption here is that TFs with a higher (absolute) correlation are more likely to be involved in the regulation. Please note that this is one of the more bespoke steps in this analysis and it is difficult to fully automate. For better control of the steps in this analysis, I recommend looking at the corresponding functions in scover_helper_functions.R.

The following code will produce plots in the outdir that show the expression pattern for the different cluster candidates, including a Spearman correlation score for the different cluster candidate TFs. In addition, it will generate one directory per motif cluster that contain plots showing how the expression and influence values are related. It will also return a DataFrame that contains the relevant information.

cluster_correlations_df <- plot_candidate_tf_information(cluster_df = cluster_df, 
                              influence_matrix = influence_scores, 
                              motif_stats = motif_stats, 
                              sce = sce, 
                              tomtom_table = tomtom_table)

As an example we consider the cluster with YY1, YY2, and THAP1. One of the figures generated by the above function shows the expression levels of the different genes in the family across cell types:

scover_output_workflow

YY1 has a higher expression than YY2, making it a more likely regulator than YY2. In addition, on the left we can see that YY1 and YY2 have a higher correlation to the motif weights (green) and more Tomtom alignments (purple) than THAP1, making them the more likely candidates.

Let’s take a closer look at the correlation plots for YY1 and YY2:

YY1 YY2

Here, the cell type categories are annotated by color.

We can now update motif cluster annotations using the correlation information. Sometimes it is hard to identify a single transcription factor, in which case I annotate the cluster with the motif family names. Sometimes it can be good to look at individual motifs in the motif cluster and look directly at the alignments in Tomtom; for this, see the all_tomtom/tomtom.html webpage generated by Tomtom.

I do this using the following code (I look at View(cluster_df[cluster_df$cluster_reproducibility >= 0.5,]) to see which number corresponds to which cluster):

cluster_df$cluster_annot[rownames(cluster_df) == "7"] <- "ETS motif family"
cluster_df$cluster_annot[rownames(cluster_df) == "11"] <- "ZBTB33/BRCA1"
cluster_df$cluster_annot[rownames(cluster_df) == "2"] <- "EGR/KLF motif families"
cluster_df$cluster_annot[rownames(cluster_df) == "9"] <- "YY1"
cluster_df$cluster_annot[rownames(cluster_df) == "3"] <- "bZIP motif family"
cluster_df$cluster_annot[rownames(cluster_df) == "1"] <- "ZFX"

# Update the annotation using the updated "cluster_df"
motif_stats <- get_motif_stats(cluster_df = cluster_df,
                               average_influence_matrix = average_influence_matrix, 
                               pseudocount = 1e-11)
## number of iterations= 168 
## No clear bimodal distribution: likely not a lot of 'dead motifs'
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

annotation_colors <- get_annotation_colors(motif_stats)
pheatmap(average_influence_matrix_z_scores[order(motif_stats$cluster_annot),], 
         annotation_row=motif_stats[,c("cluster_annot", 
                                       "motif_celltype_entropy")],
         show_rownames = FALSE, angle_col = 45, 
         annotation_colors = annotation_colors, 
         cluster_rows = FALSE, 
         cellwidth = 8, 
         cellheight = .4,
         color = heatmap_colors)

Plotting all motif family correlations

It is helpful to be able to view the correlations for all TFs related to the motif families reported by the network. Using the following function, you can plot the correlations and expression values for the different motif families, while annotating the top top_n hits for each motif family:

# First I update the motif family names:
cluster_correlations_df$cluster_annot[cluster_correlations_df$cluster == "7"] <- "ETS motif family"
cluster_correlations_df$cluster_annot[cluster_correlations_df$cluster == "11"] <- "ZBTB33/BRCA1"
cluster_correlations_df$cluster_annot[cluster_correlations_df$cluster == "2"] <- "EGR/KLF motif families"
cluster_correlations_df$cluster_annot[cluster_correlations_df$cluster == "9"] <- "YY1"
cluster_correlations_df$cluster_annot[cluster_correlations_df$cluster == "3"] <- "bZIP motif family"
cluster_correlations_df$cluster_annot[cluster_correlations_df$cluster == "1"] <- "ZFX"

plot_motif_family_correlation_overview(cluster_correlations_df, top_n = 5)

Aggregate influence scores for motif families

To get an idea of the influence score of a particular motif family, you can sum the influence scores of the motifs belonging to that family. Here, I cycle through the motif families and fill an empty matrix with the summed values of the corresponding motif influence scores:

# Create empty matrix
aggregate_motif_influences <- matrix(nrow=length(unique(motif_stats$cluster_annot)),
                                     ncol=ncol(influence_scores))
# Cycle through motif families
for(i in 1:nrow(aggregate_motif_influences)){
  current_motif_family <- unique(motif_stats$cluster_annot)[i]
  curr_motifs <- rownames(motif_stats[motif_stats$cluster_annot == current_motif_family,])
  current_motif_influences <- influence_scores[curr_motifs,]
  aggregate_motif_influences[i,] <- colSums(current_motif_influences)
}
# Add column names: pool names
rownames(aggregate_motif_influences) <- unique(motif_stats$cluster_annot)
colnames(aggregate_motif_influences) <- colnames(influence_scores)

(With the same method, you can also derive motif family scores averaged per cell type category, for example.)

Now that we have these values, we can visualize them with ggplot. First I β€œmelt” the dataframe, then I add cell type category information, and I use visualisation tools from the gghalves package.

aggregate_motif_influences_melt <- melt(aggregate_motif_influences)
colnames(aggregate_motif_influences_melt) <- c("motif_fam",
                                               "pool",
                                               "aggregate_influence")

aggregate_motif_influences_melt$celltype <- str_split_fixed(aggregate_motif_influences_melt$pool, "_", 2)[,1]
aggregate_motif_influences_melt$category <- Category_table[aggregate_motif_influences_melt$celltype]

ggplot(aggregate_motif_influences_melt, aes(x=category, y=aggregate_influence,
                                            fill=category)) +
  geom_half_boxplot() + geom_half_violin(side="r") + # from 'gghalves' package
  theme_Nice() + # from 'scover_helper_functions.R'
  scale_x_reordered() + # from 'tidytext' package 
  scale_fill_stata() + # from 'ggthemes' package
  labs(x="Category", y="Aggregate motif influence") +
  facet_wrap(~motif_fam, scales="free") 

PCA plots based on motif influence scores or expression values

To visualize the expression landscape or the regulatory landscape, it is useful to use dimensionality reduction. Here, I will use PCA to plot

  1. Motif influence scores across pools
  2. Expression values across pools

First, I will do the PCA based on motif influence scores:

pca_result <- prcomp(t(influence_scores))

# eigenvalues
eigs <- pca_result$sdev^2
variances_explained <- eigs/sum(sum(eigs))

pca_result <- as.data.frame(pca_result$x)

# See what the coordinates are:
print(pca_result[1:2,1:3])
##                            PC1         PC2         PC3
## Fetal B cell_pool1 -0.01976748 -0.03901322 0.002209844
## Fetal B cell_pool2 -0.01769229 -0.03388922 0.002284000
# Variances explained per PC
print(variances_explained[1:5])
## [1] 0.53261500 0.26351268 0.08142457 0.02166117 0.01295619

I can add cell type and cell type category annotation.

pca_result$Celltype <- colData(sce)[rownames(pca_result),]$cell_type1
pca_result$Category <- Category_table[pca_result$Celltype]

In this case, the dataset was made using two different datasets: fetal and mature kidney. I will add annotation for the origin of the pools this way (but this is specific to this analysis):

pca_result$Origin <- sapply(pca_result$Celltype, FUN=function(x){
  str_split(x, " ")[[1]][1] 
})

Now, I will plot the PCA plot, showing the cell type categories as colours, and the origin of the pools as the shapes:

# Function "scale_color_stata()" is from package ggthemes, 
# and function "geom_label_repel()" is from package ggrepel. 

ggplot(pca_result, aes(x=PC1, y=PC2, color=Category, shape=Origin)) + geom_point() +
  scale_color_stata() + theme_Nice(angled=FALSE) + theme(legend.position = "right") +
  labs(x=paste0("PC1 (", round(variances_explained[1]*100, 2), "%)"),
       y=paste0("PC2 (", round(variances_explained[2]*100, 2), "%)")) +
  coord_fixed() 

Now, I will do exactly the same, but based on expression values:

pca_result_expression <- prcomp(t(logcounts(sce)))
eigs_expression <- pca_result_expression$sdev^2
variances_explained_expression <- eigs_expression/sum(sum(eigs_expression))
pca_result_expression <- as.data.frame(pca_result_expression$x)

# See what the coordinates are:
print(pca_result_expression[1:2,1:3])
##                          PC1       PC2       PC3
## Fetal B cell_pool1 -84.77157 -94.90995 -7.852795
## Fetal B cell_pool2 -86.36306 -91.49847 -6.077714
# Variances explained per PC
print(variances_explained_expression[1:5])
## [1] 0.38369206 0.14193260 0.03773392 0.03144243 0.02420017
pca_result_expression$Celltype <- sce$cell_type1
pca_result_expression$Category <- sce$Category
pca_result_expression$Origin <- sapply(pca_result_expression$Celltype, 
                                       FUN=function(x){
                                         str_split(x, " ")[[1]][1]
                                       })

ggplot(pca_result_expression, aes(x=PC1, y=PC2, color=Category, shape=Origin)) + geom_point() +
  scale_color_stata() + theme_Nice(angled=FALSE) + theme(legend.position = "right") +
  labs(x=paste0("PC1 (", round(variances_explained_expression[1]*100, 2), "%)"),
       y=paste0("PC2 (", round(variances_explained_expression[2]*100, 2), "%)")) +
  coord_fixed() 

sessionInfo()

sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas/liblapack.so.3
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] grid      parallel  stats4    stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] tidytext_0.2.6              plotly_4.9.2.1             
##  [3] pheatmap_1.0.12             ggrepel_0.8.2              
##  [5] gghalves_0.1.0              ggthemes_4.2.0             
##  [7] ggplot2_3.3.1               stringr_1.4.0              
##  [9] igraph_1.2.5                reshape2_1.4.4             
## [11] magrittr_1.5                dplyr_1.0.0                
## [13] SingleCellExperiment_1.8.0  SummarizedExperiment_1.16.1
## [15] DelayedArray_0.12.3         BiocParallel_1.20.1        
## [17] matrixStats_0.56.0          Biobase_2.46.0             
## [19] GenomicRanges_1.38.0        GenomeInfoDb_1.22.1        
## [21] IRanges_2.20.2              S4Vectors_0.24.4           
## [23] BiocGenerics_0.32.0         mixtools_1.2.0             
## [25] rhdf5_2.30.1               
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.5.1          httr_1.4.1             tidyr_1.1.0           
##  [4] jsonlite_1.6.1         viridisLite_0.3.0      splines_3.6.3         
##  [7] GenomeInfoDbData_1.2.2 yaml_2.2.1             pillar_1.4.4          
## [10] lattice_0.20-40        glue_1.4.1             digest_0.6.25         
## [13] RColorBrewer_1.1-2     XVector_0.26.0         colorspace_1.4-1      
## [16] htmltools_0.4.0        Matrix_1.2-18          plyr_1.8.6            
## [19] pkgconfig_2.0.3        zlibbioc_1.32.0        purrr_0.3.4           
## [22] scales_1.1.1           tibble_3.0.1           generics_0.0.2        
## [25] farver_2.0.3           ellipsis_0.3.1         withr_2.2.0           
## [28] lazyeval_0.2.2         survival_3.1-8         crayon_1.3.4          
## [31] evaluate_0.14          tokenizers_0.2.1       janeaustenr_0.1.5     
## [34] MASS_7.3-51.5          SnowballC_0.7.0        segmented_1.1-0       
## [37] tools_3.6.3            data.table_1.12.8      lifecycle_0.2.0       
## [40] Rhdf5lib_1.8.0         kernlab_0.9-29         munsell_0.5.0         
## [43] compiler_3.6.3         rlang_0.4.6            RCurl_1.98-1.2        
## [46] htmlwidgets_1.5.1      bitops_1.0-6           labeling_0.3          
## [49] rmarkdown_2.2          gtable_0.3.0           R6_2.4.1              
## [52] gridExtra_2.3          knitr_1.28             stringi_1.4.6         
## [55] Rcpp_1.0.4.6           vctrs_0.3.1            tidyselect_1.1.0      
## [58] xfun_0.14